Temporal Rules Discovery for Web Data Cleaning

نویسندگان

  • Ziawasch Abedjan
  • Cuneyt Gurcan Akcora
  • Mourad Ouzzani
  • Paolo Papotti
  • Michael Stonebraker
چکیده

Declarative rules, such as functional dependencies, are widely used for cleaning data. Several systems take them as input for detecting errors and computing a “clean” version of the data. To support domain experts,in specifying these rules, several tools have been proposed to profile the data and mine rules. However, existing discovery techniques have traditionally ignored the time dimension. Recurrent events, such as persons reported in locations, have a duration in which they are valid, and this duration should be part of the rules or the cleaning process would simply fail. In this work, we study the rule discovery problem for temporal web data. Such a discovery process is challenging because of the nature of web data; extracted facts are (i) sparse over time, (ii) reported with delays, and (iii) often reported with errors over the values because of inaccurate sources or non robust extractors. We handle these challenges with a new discovery approach that is more robust to noise. Our solution uses machine learning methods, such as association measures and outlier detection, for the discovery of the rules, together with an aggressive repair of the data in the mining step itself. Our experimental evaluation over real-world data from Recorded Future, an intelligence company that monitors over 700K Web sources, shows that temporal rules improve the quality of the data with an increase of the average precision in the cleaning process from 0.37 to 0.84, and a 40% relative increase in the average F-measure. 1. INTRODUCTION With the increasing availability of web data, we are witnessing the proliferation of businesses engaged in automatic data extraction from thousands of web sources with the goal of gleaning useful information and intelligence about people, companies, countries, products, and organizations [30]. It is well recognized that the data cannot be used as-is because of errors that are in the sources themselves [15, 28, 29, 33] or that arise with automatic extractors [7, 13]. This work is licensed under the Creative Commons AttributionNonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Proceedings of the VLDB Endowment, Vol. 9, No. 4 Copyright 2015 VLDB Endowment 2150-8097/15/12. Obama will arrive in Italy 12 Nov 8pm Apple released new iPhone on 19 Sept... tomorrow new iPhone 3G/4G/LTE (09.18) Obama in S. Africa 8.30pm 12 Nov CNN Twitter MacFan Times

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Web Services Discovery and Recommendation Based on Information Extraction and Symbolic Reputation

This paper shows that the problem of web services representation is crucial and analyzes the various factors that influence on it. It presents the traditional representation of web services considering traditional textual descriptions based on the information contained in WSDL files. Unfortunately, textual web services descriptions are dirty and need significant cleaning to keep only useful inf...

متن کامل

Research of Data Cleaning Methods Based on Dependency Rules

This paper introduces the concept and principle of data cleaning, analyzes the types and causes of dirty data, and proposes several key steps of typical cleaning process, puts forward a well scalability and versatility data cleaning framework, in view of data with attribute dependency relation, designs several of violation data discovery algorithms by formal formula, which can obtain inconsiste...

متن کامل

Mining Association Rules in Temporal Document Collections

In this paper we describe how to mine association rules in temporal document collections. We describe how to perform the various steps in the temporal text mining process, including data cleaning, text refinement, temporal association rule mining and rule post-processing. We also describe the Temporal Text Mining Testbench, which is a user-friendly and versatile tool for performing temporal tex...

متن کامل

Expert Discovery: A web mining approach

Expert discovery is a quest in search of finding an answer to a question: “Who is the best expert of a specific subject in a particular domain within peculiar array of parameters?” Expert with domain knowledge in any field is crucial for consulting in industry, academia and scientific community. Aim of this study is to address the issues for expert-finding task in real-world community. Collabor...

متن کامل

Web-Log Cleaning for Constructing Sequential Classifiers

With millions of web users visiting web servers each day, the web log contains valuable information about users’ browsing behavior. In this work, we construct sequential classifiers for predicting the users’ next visits based on the current actions using association rule mining. The domain feature of web log mining entails that we adopt a special kind of association rules we call latestsubstrin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 9  شماره 

صفحات  -

تاریخ انتشار 2015